====================================================================================================================
====================================================================================================================
• DOMAIN: Semiconductor manufacturing process
• CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/ variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.
• DATA DESCRIPTION: sensor-data.csv : (1567, 592)
The data consists of 1567 examples each with 591 features.
The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing.
Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.
• PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.
#Import Libraries
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from os import system
from IPython.display import Image
from sklearn.tree import plot_tree
from scipy.stats import zscore
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True) # adds a nice background to the graphs
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score, precision_score, recall_score,f1_score
import os
from sklearn.preprocessing import LabelEncoder
from scipy import stats
%matplotlib inline
sns.set_style('darkgrid')
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler
from sklearn import model_selection
import warnings
warnings.filterwarnings("ignore")
from collections import Counter
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTENC
from imblearn.over_sampling import SMOTE
# Confusion Matrix
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV,RandomizedSearchCV
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.cluster import AgglomerativeClustering
from IPython.display import display
from scipy.spatial.distance import cdist
from pandas.api.types import is_numeric_dtype
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist #Pairwise distribution between data points
from scipy.cluster.hierarchy import fcluster
# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_selection import SelectFromModel
import xgboost as xgb
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTETomek
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
import pickle
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
df1 = pd.read_csv("signal-data.csv")
df1.shape
(1567, 592)
Here we have 1567 rows and 592 columns
df1.head(3)
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | NaN | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 208.2045 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 82.8602 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | 1 |
3 rows × 592 columns
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1567 entries, 0 to 1566 Columns: 592 entries, Time to Pass/Fail dtypes: float64(590), int64(1), object(1) memory usage: 7.1+ MB
• Missing value treatment.
• Drop attribute/s if required using relevant functional knowledge.
• Make all relevant modifications on the data using both functional/logical reasoning/assumptions
• Missing value treatment.
df1.isnull().sum().sum()
41951
We have 41951 missing values, which is huge
df1.isnull().sum().max()
1429
Maximum null value in a column is 1429.
for column in df1.columns:
null_val = df1[column].isnull().sum()
if null_val > (1567/4) :
print(null_val)
794 794 1341 1018 1018 1018 715 1429 1429 1341 1018 1018 1018 715 1429 1429 794 794 1341 1018 1018 1018 715 1341 1018 1018 1018 715 949 949 949 949
We have lots of columns with large numbers of missing values
Replacing all the NaN values with 0 as the values correspond to the test results.
Since, the values are not present that means the values are not available or calculated.
Lets Drop columns with more than 25% missing values
Absence of a signal is assumed to be no signal in the dataset
So better we not take median or mean and replace them with zeros
missing_value_columns = df1.columns[df1.isna().mean() >= 0.25]
df2 = df1.drop(missing_value_columns, axis=1)
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1567 entries, 0 to 1566 Columns: 560 entries, Time to Pass/Fail dtypes: float64(558), int64(1), object(1) memory usage: 6.7+ MB
df2 = df2.replace(np.NaN, 0)
# again, checking if there is any NULL values left
df2.isnull().any().any()
False
We have dropped the columns with very high number of missing values. And replace the nan with 0
• Drop attribute/s if required using relevant functional knowledge.
• Make all relevant modifications on the data using both functional/logical reasoning/assumptions
df2.describe()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 577 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | ... | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 |
| mean | 3002.910638 | 2484.700932 | 2180.887035 | 1383.901023 | 4.159516 | 99.106573 | 100.209538 | 0.121122 | 1.460995 | -0.000840 | ... | 16.642363 | 0.499777 | 0.015308 | 0.003844 | 3.065869 | 0.021445 | 0.016464 | 0.005280 | 99.606461 | -0.867262 |
| std | 200.204648 | 184.815753 | 209.206773 | 458.937272 | 56.104457 | 9.412812 | 11.363940 | 0.012831 | 0.090461 | 0.015107 | ... | 12.485267 | 0.013084 | 0.017179 | 0.003721 | 3.577730 | 0.012366 | 0.008815 | 0.002869 | 93.895701 | 0.498010 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.053400 | ... | 4.582000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.016900 | 0.000000 | 0.000000 | 0.000000 | -1.000000 |
| 25% | 2965.670000 | 2451.515000 | 2180.700000 | 1080.116050 | 1.011000 | 100.000000 | 97.762200 | 0.121100 | 1.410950 | -0.010800 | ... | 11.501550 | 0.497900 | 0.011600 | 0.003100 | 2.306200 | 0.013400 | 0.010600 | 0.003300 | 44.368600 | -1.000000 |
| 50% | 3010.920000 | 2498.910000 | 2200.955600 | 1283.436800 | 1.310100 | 100.000000 | 101.492200 | 0.122400 | 1.461500 | -0.001300 | ... | 13.817900 | 0.500200 | 0.013800 | 0.003600 | 2.757600 | 0.020500 | 0.014800 | 0.004600 | 71.778000 | -1.000000 |
| 75% | 3056.540000 | 2538.745000 | 2218.055500 | 1590.169900 | 1.518800 | 100.000000 | 104.530000 | 0.123800 | 1.516850 | 0.008400 | ... | 17.080900 | 0.502350 | 0.016500 | 0.004100 | 3.294950 | 0.027600 | 0.020300 | 0.006400 | 114.749700 | -1.000000 |
| max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 100.000000 | 129.252200 | 0.128600 | 1.656400 | 0.074900 | ... | 96.960100 | 0.509800 | 0.476600 | 0.104500 | 99.303200 | 0.102800 | 0.079900 | 0.028600 | 737.304800 | 1.000000 |
8 rows × 559 columns
df2["Pass/Fail"].dtype
dtype('int64')
plt.rcParams['figure.figsize'] = (18, 18)
sns.heatmap(df2.corr(), cmap = "YlGnBu")
plt.title('Correlation heatmap for the Data', fontsize = 20)
Text(0.5, 1.0, 'Correlation heatmap for the Data')
We can see that there are few highly correlated data present. Hence we can remove the correlated data.
def remove_collinear_features(x, threshold):
# Calculate the correlation matrix
corr_matrix = x.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []
# Iterate through the correlation matrix and compare correlations
for i in iters:
for j in range(i+1):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index
val = abs(item.values)
# If correlation exceeds the threshold
if val >= threshold:
# Print the correlated features and the correlation value
print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
drop_cols.append(col.values[0])
# Drop one of each pair of correlated columns
drops = set(drop_cols)
x = x.drop(columns=drops)
return x
#Remove columns having more than 70% correlation
#Both positive and negative correlations are considered here
data = remove_collinear_features(df2,0.70)
5 | 2 | 0.99 6 | 2 | 0.82 6 | 5 | 0.84 7 | 2 | 0.75 7 | 5 | 0.77 12 | 11 | 0.83 17 | 11 | 0.81 18 | 11 | 0.72 18 | 12 | 0.79 18 | 17 | 0.9 22 | 21 | 0.75 26 | 25 | 0.83 27 | 25 | 0.98 27 | 26 | 0.79 30 | 29 | 0.86 34 | 32 | 0.88 35 | 34 | 0.83 37 | 32 | 0.75 37 | 34 | 0.75 38 | 32 | 0.72 38 | 34 | 0.72 38 | 36 | 0.71 38 | 37 | 0.97 39 | 32 | 0.81 39 | 34 | 0.91 39 | 35 | 0.71 39 | 37 | 0.86 39 | 38 | 0.84 43 | 42 | 0.82 44 | 42 | 0.82 46 | 42 | 0.84 46 | 43 | 0.84 46 | 45 | 0.74 48 | 44 | 0.77 49 | 42 | 1.0 49 | 43 | 0.82 49 | 44 | 0.82 49 | 46 | 0.84 50 | 42 | 0.88 50 | 43 | 0.9 50 | 46 | 0.97 50 | 49 | 0.88 51 | 47 | 0.71 54 | 53 | 1.0 55 | 53 | 0.95 55 | 54 | 0.95 56 | 53 | 0.98 56 | 54 | 0.97 56 | 55 | 0.96 57 | 53 | 0.97 57 | 54 | 0.96 57 | 55 | 0.98 57 | 56 | 0.99 58 | 53 | 0.93 58 | 54 | 0.93 58 | 55 | 0.92 58 | 56 | 0.96 58 | 57 | 0.94 61 | 60 | 0.88 65 | 64 | 0.85 66 | 60 | 0.97 66 | 61 | 0.84 66 | 62 | 0.75 68 | 60 | 0.85 68 | 61 | 0.87 68 | 66 | 0.85 69 | 60 | 0.96 69 | 61 | 0.92 69 | 66 | 0.97 69 | 68 | 0.91 70 | 60 | 0.97 70 | 61 | 0.85 70 | 62 | 0.76 70 | 66 | 0.99 70 | 68 | 0.88 70 | 69 | 0.97 96 | 94 | 0.96 98 | 94 | 0.84 98 | 96 | 0.87 101 | 94 | 0.73 101 | 96 | 0.79 101 | 98 | 0.91 104 | 99 | 0.99 105 | 92 | 0.99 106 | 93 | 0.99 123 | 121 | 1.0 124 | 121 | 1.0 124 | 123 | 1.0 125 | 122 | 0.74 127 | 122 | 0.97 128 | 126 | 0.72 131 | 121 | 1.0 131 | 123 | 0.99 131 | 124 | 0.99 132 | 121 | 0.89 132 | 123 | 0.89 132 | 124 | 0.89 132 | 131 | 0.9 133 | 121 | 0.94 133 | 123 | 0.94 133 | 124 | 0.94 133 | 131 | 0.94 133 | 132 | 0.95 140 | 4 | 1.0 147 | 16 | 0.87 148 | 16 | 0.96 148 | 147 | 0.89 152 | 16 | 0.96 152 | 147 | 0.9 152 | 148 | 0.99 154 | 16 | 0.87 154 | 147 | 0.8 154 | 148 | 0.94 154 | 152 | 0.89 163 | 159 | 0.76 164 | 26 | 0.72 164 | 159 | 0.8 164 | 163 | 0.92 165 | 159 | 0.79 165 | 163 | 0.9 165 | 164 | 0.96 174 | 172 | 1.0 185 | 184 | 0.71 187 | 185 | 0.83 196 | 67 | 0.86 197 | 67 | 0.86 197 | 196 | 0.89 198 | 196 | 0.7 198 | 197 | 0.72 199 | 67 | 0.81 199 | 196 | 0.94 199 | 197 | 0.83 199 | 198 | 0.72 202 | 74 | 0.83 202 | 201 | 0.8 203 | 196 | 0.81 203 | 197 | 0.71 203 | 199 | 0.8 203 | 200 | 0.76 203 | 202 | 0.84 204 | 67 | 0.9 204 | 196 | 0.87 204 | 197 | 0.81 204 | 199 | 0.83 204 | 203 | 0.79 205 | 67 | 0.87 205 | 196 | 0.86 205 | 197 | 0.84 205 | 198 | 0.71 205 | 199 | 0.8 205 | 204 | 0.83 206 | 74 | 1.0 206 | 202 | 0.83 207 | 67 | 0.86 207 | 196 | 0.92 207 | 197 | 0.87 207 | 199 | 0.88 207 | 200 | 0.71 207 | 203 | 0.86 207 | 204 | 0.87 207 | 205 | 0.87 209 | 74 | 1.0 209 | 202 | 0.83 209 | 206 | 1.0 248 | 114 | 0.7 249 | 114 | 0.98 249 | 248 | 0.73 252 | 117 | 0.99 254 | 119 | 0.8 270 | 135 | 0.94 271 | 136 | 0.97 272 | 137 | 0.98 273 | 138 | 0.93 274 | 139 | 0.99 275 | 4 | 1.0 275 | 140 | 1.0 277 | 4 | 0.71 277 | 140 | 0.71 277 | 142 | 0.98 277 | 275 | 0.71 278 | 143 | 0.92 279 | 144 | 0.98 280 | 145 | 0.96 281 | 146 | 0.95 282 | 16 | 0.87 282 | 147 | 1.0 282 | 148 | 0.89 282 | 152 | 0.89 282 | 154 | 0.8 283 | 16 | 0.96 283 | 147 | 0.89 283 | 148 | 1.0 283 | 152 | 0.99 283 | 154 | 0.94 283 | 282 | 0.89 285 | 150 | 0.97 286 | 151 | 0.99 287 | 16 | 0.96 287 | 147 | 0.9 287 | 148 | 0.99 287 | 152 | 1.0 287 | 154 | 0.89 287 | 282 | 0.89 287 | 283 | 0.99 288 | 153 | 1.0 289 | 16 | 0.87 289 | 147 | 0.81 289 | 148 | 0.94 289 | 152 | 0.89 289 | 154 | 0.99 289 | 282 | 0.81 289 | 283 | 0.94 289 | 287 | 0.89 290 | 155 | 0.95 291 | 156 | 0.99 294 | 159 | 0.99 294 | 163 | 0.79 294 | 164 | 0.83 294 | 165 | 0.82 295 | 160 | 1.0 296 | 161 | 0.99 297 | 162 | 0.99 298 | 159 | 0.77 298 | 163 | 0.99 298 | 164 | 0.94 298 | 165 | 0.92 298 | 294 | 0.81 299 | 26 | 0.72 299 | 159 | 0.8 299 | 163 | 0.92 299 | 164 | 1.0 299 | 165 | 0.96 299 | 294 | 0.83 299 | 298 | 0.95 300 | 159 | 0.79 300 | 163 | 0.9 300 | 164 | 0.97 300 | 165 | 1.0 300 | 294 | 0.82 300 | 298 | 0.93 300 | 299 | 0.97 301 | 166 | 0.96 302 | 167 | 0.98 303 | 168 | 0.96 304 | 27 | 0.71 304 | 169 | 0.98 305 | 170 | 0.96 306 | 171 | 0.99 307 | 172 | 0.96 307 | 174 | 0.96 308 | 173 | 0.96 309 | 172 | 0.96 309 | 174 | 0.96 309 | 307 | 1.0 310 | 175 | 0.96 311 | 176 | 0.98 312 | 177 | 1.0 316 | 180 | 0.88 317 | 181 | 0.96 318 | 182 | 0.98 319 | 183 | 0.98 320 | 184 | 0.99 320 | 185 | 0.72 321 | 184 | 0.71 321 | 185 | 0.99 321 | 187 | 0.83 321 | 320 | 0.72 323 | 185 | 0.82 323 | 187 | 0.99 323 | 321 | 0.82 324 | 188 | 0.98 331 | 195 | 0.95 332 | 67 | 0.88 332 | 196 | 0.96 332 | 197 | 0.9 332 | 199 | 0.9 332 | 203 | 0.71 332 | 204 | 0.82 332 | 205 | 0.89 332 | 207 | 0.91 333 | 67 | 0.87 333 | 196 | 0.87 333 | 197 | 0.98 333 | 198 | 0.74 333 | 199 | 0.8 333 | 204 | 0.79 333 | 205 | 0.85 333 | 207 | 0.87 333 | 332 | 0.9 334 | 67 | 0.75 334 | 196 | 0.75 334 | 197 | 0.78 334 | 198 | 0.99 334 | 199 | 0.75 334 | 204 | 0.7 334 | 205 | 0.76 334 | 207 | 0.73 334 | 332 | 0.73 334 | 333 | 0.8 335 | 67 | 0.85 335 | 196 | 0.93 335 | 197 | 0.86 335 | 199 | 0.96 335 | 203 | 0.72 335 | 204 | 0.79 335 | 205 | 0.85 335 | 207 | 0.9 335 | 332 | 0.96 335 | 333 | 0.86 335 | 334 | 0.74 336 | 67 | 0.87 336 | 196 | 0.91 336 | 197 | 0.9 336 | 198 | 0.7 336 | 199 | 0.88 336 | 203 | 0.72 336 | 204 | 0.81 336 | 205 | 0.88 336 | 207 | 0.9 336 | 332 | 0.94 336 | 333 | 0.9 336 | 334 | 0.76 336 | 335 | 0.93 337 | 201 | 0.93 337 | 202 | 0.81 338 | 74 | 0.87 338 | 201 | 0.75 338 | 202 | 0.99 338 | 203 | 0.86 338 | 204 | 0.7 338 | 206 | 0.87 338 | 209 | 0.87 338 | 337 | 0.76 339 | 74 | 0.78 339 | 196 | 0.76 339 | 199 | 0.76 339 | 200 | 0.73 339 | 202 | 0.88 339 | 203 | 0.98 339 | 204 | 0.75 339 | 206 | 0.78 339 | 207 | 0.82 339 | 209 | 0.78 339 | 338 | 0.9 340 | 67 | 0.95 340 | 196 | 0.85 340 | 197 | 0.82 340 | 199 | 0.81 340 | 203 | 0.72 340 | 204 | 0.99 340 | 205 | 0.84 340 | 207 | 0.84 340 | 332 | 0.82 340 | 333 | 0.81 340 | 334 | 0.73 340 | 335 | 0.79 340 | 336 | 0.82 341 | 67 | 0.9 341 | 196 | 0.86 341 | 197 | 0.85 341 | 198 | 0.72 341 | 199 | 0.81 341 | 204 | 0.87 341 | 205 | 0.99 341 | 207 | 0.87 341 | 332 | 0.89 341 | 333 | 0.86 341 | 334 | 0.78 341 | 335 | 0.85 341 | 336 | 0.89 341 | 340 | 0.89 342 | 74 | 1.0 342 | 202 | 0.83 342 | 206 | 1.0 342 | 209 | 1.0 342 | 338 | 0.87 342 | 339 | 0.78 343 | 67 | 0.87 343 | 196 | 0.9 343 | 197 | 0.88 343 | 199 | 0.87 343 | 203 | 0.8 343 | 204 | 0.83 343 | 205 | 0.88 343 | 207 | 0.98 343 | 332 | 0.93 343 | 333 | 0.89 343 | 334 | 0.73 343 | 335 | 0.92 343 | 336 | 0.92 343 | 339 | 0.75 343 | 340 | 0.82 343 | 341 | 0.88 344 | 208 | 0.96 347 | 74 | 1.0 347 | 202 | 0.83 347 | 206 | 1.0 347 | 209 | 1.0 347 | 338 | 0.87 347 | 339 | 0.78 347 | 342 | 1.0 348 | 210 | 0.95 349 | 211 | 0.99 350 | 212 | 0.99 351 | 213 | 1.0 352 | 214 | 0.98 353 | 215 | 0.98 354 | 216 | 0.97 355 | 217 | 0.99 356 | 218 | 0.95 357 | 219 | 0.98 359 | 221 | 0.98 360 | 222 | 0.99 361 | 223 | 0.98 362 | 224 | 1.0 363 | 225 | 0.97 365 | 227 | 0.97 366 | 228 | 0.97 367 | 227 | 0.74 367 | 365 | 0.74 368 | 228 | 0.8 368 | 366 | 0.77 368 | 367 | 0.8 376 | 238 | 0.97 377 | 239 | 0.96 386 | 248 | 1.0 386 | 249 | 0.73 387 | 114 | 0.98 387 | 248 | 0.73 387 | 249 | 1.0 387 | 386 | 0.73 388 | 250 | 0.97 389 | 251 | 1.0 390 | 117 | 0.99 390 | 252 | 1.0 391 | 253 | 0.99 392 | 119 | 0.79 392 | 254 | 0.99 393 | 255 | 0.99 405 | 267 | 0.99 406 | 268 | 0.97 407 | 269 | 0.96 408 | 135 | 1.0 408 | 270 | 0.94 409 | 136 | 1.0 409 | 271 | 0.97 410 | 137 | 1.0 410 | 272 | 0.97 411 | 138 | 1.0 411 | 273 | 0.94 412 | 139 | 0.85 412 | 274 | 0.83 413 | 4 | 0.94 413 | 140 | 0.94 413 | 275 | 0.94 415 | 142 | 0.99 415 | 277 | 0.97 416 | 143 | 1.0 416 | 278 | 0.91 417 | 144 | 0.99 417 | 279 | 0.97 420 | 16 | 0.88 420 | 147 | 1.0 420 | 148 | 0.9 420 | 152 | 0.91 420 | 154 | 0.81 420 | 282 | 1.0 420 | 283 | 0.9 420 | 287 | 0.91 420 | 289 | 0.82 421 | 16 | 0.95 421 | 147 | 0.89 421 | 148 | 1.0 421 | 152 | 0.98 421 | 154 | 0.95 421 | 282 | 0.88 421 | 283 | 1.0 421 | 287 | 0.98 421 | 289 | 0.95 421 | 420 | 0.9 424 | 151 | 0.98 424 | 286 | 0.97 425 | 16 | 0.92 425 | 147 | 0.87 425 | 148 | 0.96 425 | 152 | 0.98 425 | 154 | 0.86 425 | 282 | 0.87 425 | 283 | 0.96 425 | 287 | 0.97 425 | 289 | 0.86 425 | 420 | 0.88 425 | 421 | 0.95 426 | 153 | 1.0 426 | 288 | 0.99 427 | 16 | 0.89 427 | 147 | 0.82 427 | 148 | 0.95 427 | 152 | 0.91 427 | 154 | 1.0 427 | 282 | 0.82 427 | 283 | 0.95 427 | 287 | 0.91 427 | 289 | 0.99 427 | 420 | 0.83 427 | 421 | 0.97 427 | 425 | 0.88 428 | 155 | 1.0 428 | 290 | 0.96 429 | 156 | 1.0 429 | 291 | 0.99 430 | 159 | 0.87 430 | 163 | 0.83 430 | 164 | 0.88 430 | 165 | 0.85 430 | 294 | 0.89 430 | 298 | 0.84 430 | 299 | 0.87 430 | 300 | 0.85 431 | 160 | 0.81 431 | 163 | 0.81 431 | 164 | 0.85 431 | 165 | 0.81 431 | 294 | 0.72 431 | 295 | 0.83 431 | 298 | 0.83 431 | 299 | 0.85 431 | 300 | 0.82 431 | 430 | 0.9 434 | 26 | 0.77 434 | 159 | 0.71 434 | 163 | 0.88 434 | 164 | 0.9 434 | 165 | 0.86 434 | 294 | 0.75 434 | 298 | 0.89 434 | 299 | 0.89 434 | 300 | 0.86 434 | 430 | 0.95 434 | 431 | 0.93 435 | 26 | 0.77 435 | 159 | 0.71 435 | 163 | 0.84 435 | 164 | 0.91 435 | 165 | 0.87 435 | 294 | 0.75 435 | 298 | 0.86 435 | 299 | 0.9 435 | 300 | 0.86 435 | 430 | 0.95 435 | 431 | 0.93 435 | 434 | 0.99 436 | 26 | 0.76 436 | 159 | 0.71 436 | 163 | 0.84 436 | 164 | 0.9 436 | 165 | 0.88 436 | 294 | 0.75 436 | 298 | 0.86 436 | 299 | 0.9 436 | 300 | 0.87 436 | 430 | 0.95 436 | 431 | 0.93 436 | 434 | 0.99 436 | 435 | 1.0 437 | 166 | 0.99 437 | 301 | 0.95 439 | 168 | 0.79 439 | 303 | 0.77 440 | 27 | 0.71 440 | 169 | 1.0 440 | 304 | 0.98 441 | 170 | 1.0 441 | 305 | 0.95 442 | 171 | 0.97 442 | 306 | 0.96 443 | 172 | 1.0 443 | 174 | 1.0 443 | 307 | 0.96 443 | 309 | 0.96 444 | 173 | 0.99 444 | 308 | 0.95 445 | 172 | 1.0 445 | 174 | 1.0 445 | 307 | 0.96 445 | 309 | 0.96 445 | 443 | 0.99 446 | 175 | 1.0 446 | 310 | 0.96 447 | 176 | 1.0 447 | 311 | 0.98 448 | 177 | 1.0 448 | 312 | 1.0 452 | 180 | 0.99 452 | 316 | 0.86 453 | 181 | 1.0 453 | 317 | 0.96 454 | 182 | 0.99 454 | 318 | 0.97 455 | 183 | 1.0 455 | 319 | 0.98 456 | 184 | 0.97 456 | 185 | 0.71 456 | 320 | 0.96 456 | 321 | 0.72 457 | 185 | 1.0 457 | 187 | 0.81 457 | 320 | 0.7 457 | 321 | 0.99 457 | 323 | 0.8 457 | 456 | 0.71 459 | 185 | 0.82 459 | 187 | 1.0 459 | 321 | 0.82 459 | 323 | 0.99 459 | 457 | 0.81 467 | 195 | 1.0 467 | 331 | 0.95 469 | 67 | 0.86 469 | 196 | 0.89 469 | 197 | 1.0 469 | 198 | 0.72 469 | 199 | 0.83 469 | 203 | 0.71 469 | 204 | 0.81 469 | 205 | 0.85 469 | 207 | 0.88 469 | 332 | 0.91 469 | 333 | 0.99 469 | 334 | 0.77 469 | 335 | 0.87 469 | 336 | 0.91 469 | 340 | 0.82 469 | 341 | 0.86 469 | 343 | 0.89 470 | 197 | 0.7 470 | 198 | 1.0 470 | 333 | 0.72 470 | 334 | 0.98 470 | 341 | 0.7 471 | 196 | 0.83 471 | 199 | 0.94 471 | 202 | 0.73 471 | 203 | 0.8 471 | 204 | 0.74 471 | 207 | 0.74 471 | 332 | 0.72 471 | 335 | 0.83 471 | 336 | 0.73 471 | 337 | 0.7 471 | 338 | 0.72 471 | 339 | 0.79 473 | 201 | 0.87 473 | 337 | 0.76 474 | 201 | 0.74 474 | 202 | 0.71 474 | 337 | 0.71 474 | 473 | 0.79 475 | 74 | 0.72 475 | 196 | 0.78 475 | 199 | 0.77 475 | 200 | 0.76 475 | 202 | 0.86 475 | 203 | 1.0 475 | 204 | 0.78 475 | 206 | 0.72 475 | 207 | 0.83 475 | 209 | 0.72 475 | 338 | 0.88 475 | 339 | 0.99 475 | 342 | 0.72 475 | 343 | 0.76 475 | 347 | 0.72 475 | 471 | 0.8 477 | 67 | 0.92 477 | 196 | 0.88 477 | 197 | 0.85 477 | 199 | 0.83 477 | 204 | 0.87 477 | 205 | 0.99 477 | 207 | 0.88 477 | 332 | 0.91 477 | 333 | 0.85 477 | 334 | 0.76 477 | 335 | 0.87 477 | 336 | 0.89 477 | 340 | 0.88 477 | 341 | 0.99 477 | 343 | 0.89 477 | 469 | 0.85 478 | 74 | 1.0 478 | 202 | 0.83 478 | 206 | 1.0 478 | 209 | 1.0 478 | 338 | 0.87 478 | 339 | 0.78 478 | 342 | 1.0 478 | 347 | 1.0 478 | 475 | 0.72 479 | 67 | 0.85 479 | 196 | 0.91 479 | 197 | 0.87 479 | 199 | 0.88 479 | 200 | 0.75 479 | 203 | 0.88 479 | 204 | 0.88 479 | 205 | 0.86 479 | 207 | 1.0 479 | 332 | 0.89 479 | 333 | 0.86 479 | 334 | 0.74 479 | 335 | 0.88 479 | 336 | 0.89 479 | 338 | 0.71 479 | 339 | 0.84 479 | 340 | 0.85 479 | 341 | 0.86 479 | 343 | 0.97 479 | 469 | 0.87 479 | 471 | 0.76 479 | 475 | 0.86 479 | 477 | 0.86 480 | 208 | 0.8 480 | 344 | 0.78 490 | 218 | 0.98 490 | 356 | 0.93 491 | 219 | 1.0 491 | 357 | 0.97 493 | 221 | 1.0 493 | 359 | 0.98 494 | 222 | 1.0 494 | 360 | 1.0 495 | 223 | 1.0 495 | 361 | 0.97 496 | 224 | 0.82 496 | 362 | 0.82 497 | 225 | 0.99 497 | 363 | 0.97 520 | 248 | 1.0 520 | 249 | 0.73 520 | 386 | 1.0 520 | 387 | 0.73 522 | 250 | 0.99 522 | 388 | 0.96 523 | 251 | 1.0 523 | 389 | 1.0 524 | 117 | 0.98 524 | 252 | 1.0 524 | 390 | 1.0 525 | 253 | 1.0 525 | 391 | 0.99 526 | 119 | 0.81 526 | 254 | 1.0 526 | 392 | 0.99 527 | 255 | 1.0 527 | 393 | 0.98 539 | 267 | 1.0 539 | 405 | 0.99 540 | 268 | 1.0 540 | 406 | 0.97 541 | 269 | 0.97 541 | 407 | 0.92 545 | 543 | 0.99 547 | 546 | 0.74 548 | 546 | 0.75 548 | 547 | 0.99 549 | 546 | 0.73 550 | 547 | 0.81 550 | 548 | 0.81 552 | 546 | 0.73 552 | 549 | 1.0 553 | 547 | 0.82 553 | 548 | 0.82 553 | 550 | 0.99 554 | 551 | 1.0 555 | 549 | 0.89 555 | 552 | 0.89 556 | 547 | 0.8 556 | 548 | 0.8 556 | 550 | 1.0 556 | 553 | 0.99 557 | 551 | 1.0 557 | 554 | 1.0 560 | 559 | 0.89 561 | 559 | 0.98 561 | 560 | 0.82 563 | 562 | 0.91 564 | 562 | 0.72 564 | 563 | 0.72 566 | 562 | 0.73 566 | 563 | 0.72 566 | 564 | 0.99 567 | 563 | 0.7 567 | 565 | 0.99 568 | 562 | 0.72 568 | 563 | 0.72 568 | 564 | 1.0 568 | 566 | 0.99 569 | 565 | 0.96 569 | 567 | 0.96 573 | 572 | 0.79 574 | 572 | 0.99 574 | 573 | 0.78 575 | 572 | 0.78 575 | 573 | 0.98 575 | 574 | 0.77 576 | 572 | 0.99 576 | 573 | 0.79 576 | 574 | 0.99 576 | 575 | 0.78 577 | 572 | 0.86 577 | 573 | 0.96 577 | 574 | 0.85 577 | 575 | 0.93 577 | 576 | 0.86 584 | 583 | 0.99 585 | 583 | 1.0 585 | 584 | 1.0 588 | 587 | 0.97
# deleting the first column
data = data.drop(columns = ['Time'], axis = 1)
# checking the shape of the data after deleting a column
data.shape
(1567, 297)
data.head()
| 0 | 1 | 2 | 3 | 4 | 8 | 9 | 10 | 11 | 13 | ... | 565 | 570 | 571 | 572 | 582 | 583 | 586 | 587 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 1.5005 | 0.0162 | -0.0034 | 0.9455 | 0.0 | ... | 0.0000 | 533.8500 | 2.1113 | 8.95 | 0.5005 | 0.0118 | 0.0000 | 0.0000 | 0.0000 | -1 |
| 1 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 1.4966 | -0.0005 | -0.0148 | 0.9627 | 0.0 | ... | 0.0000 | 535.0164 | 2.4335 | 5.92 | 0.5019 | 0.0223 | 0.0096 | 0.0201 | 208.2045 | -1 |
| 2 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 1.4436 | 0.0041 | 0.0013 | 0.9615 | 0.0 | ... | 0.6219 | 535.0245 | 2.0293 | 11.21 | 0.4958 | 0.0157 | 0.0584 | 0.0484 | 82.8602 | 1 |
| 3 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 1.4882 | -0.0124 | -0.0033 | 0.9629 | 0.0 | ... | 0.1630 | 530.5682 | 2.0253 | 9.33 | 0.4990 | 0.0103 | 0.0202 | 0.0149 | 73.8432 | -1 |
| 4 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 1.5031 | -0.0031 | -0.0072 | 0.9569 | 0.0 | ... | 0.0000 | 532.0155 | 2.0275 | 8.83 | 0.4800 | 0.4766 | 0.0202 | 0.0149 | 73.8432 | -1 |
5 rows × 297 columns
• Perform detailed relevant statistical analysis on the data.
• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
data.describe()
| 0 | 1 | 2 | 3 | 4 | 8 | 9 | 10 | 11 | 13 | ... | 565 | 570 | 571 | 572 | 582 | 583 | 586 | 587 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.0 | ... | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 | 1567.000000 |
| mean | 3002.910638 | 2484.700932 | 2180.887035 | 1383.901023 | 4.159516 | 1.460995 | -0.000840 | 0.000146 | 0.963122 | 0.0 | ... | 0.120242 | 530.523623 | 2.101836 | 28.450165 | 0.499777 | 0.015308 | 0.021445 | 0.016464 | 99.606461 | -0.867262 |
| std | 200.204648 | 184.815753 | 209.206773 | 458.937272 | 56.104457 | 0.090461 | 0.015107 | 0.009296 | 0.036620 | 0.0 | ... | 0.092119 | 17.499736 | 0.275112 | 86.304681 | 0.013084 | 0.017179 | 0.012366 | 0.008815 | 93.895701 | 0.498010 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.053400 | -0.034900 | 0.000000 | 0.0 | ... | 0.000000 | 317.196400 | 0.980200 | 3.540000 | 0.000000 | 0.000000 | -0.016900 | 0.000000 | 0.000000 | -1.000000 |
| 25% | 2965.670000 | 2451.515000 | 2180.700000 | 1080.116050 | 1.011000 | 1.410950 | -0.010800 | -0.005600 | 0.958000 | 0.0 | ... | 0.087700 | 530.702700 | 1.982900 | 7.500000 | 0.497900 | 0.011600 | 0.013400 | 0.010600 | 44.368600 | -1.000000 |
| 50% | 3010.920000 | 2498.910000 | 2200.955600 | 1283.436800 | 1.310100 | 1.461500 | -0.001300 | 0.000400 | 0.965800 | 0.0 | ... | 0.090300 | 532.398200 | 2.118600 | 8.650000 | 0.500200 | 0.013800 | 0.020500 | 0.014800 | 71.778000 | -1.000000 |
| 75% | 3056.540000 | 2538.745000 | 2218.055500 | 1590.169900 | 1.518800 | 1.516850 | 0.008400 | 0.005900 | 0.971300 | 0.0 | ... | 0.166850 | 534.356400 | 2.290650 | 10.130000 | 0.502350 | 0.016500 | 0.027600 | 0.020300 | 114.749700 | -1.000000 |
| max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 1.656400 | 0.074900 | 0.053000 | 0.984800 | 0.0 | ... | 0.689200 | 589.508200 | 2.739500 | 454.560000 | 0.509800 | 0.476600 | 0.102800 | 0.079900 | 737.304800 | 1.000000 |
8 rows × 297 columns
Here we have very diverse and unscaled data present.
We have large number of data of -1 and very less data on 1
There are total 1567 rows
We can see very diverse standard deviations
Lets analyse the pass fail criteria
def univariate_analysis_piechart_bargraph(dataset,criteria):
f,axes=plt.subplots(1,2,figsize=(17,7))
dataset[criteria].value_counts().plot.pie(autopct='%1.1f%%',ax=axes[0],shadow=True)
sns.countplot(criteria,data=dataset,ax=axes[1])
axes[0].set_title(f'{criteria} Variable Pie Chart')
axes[1].set_title(f'{criteria} Variable Bar Graph')
plt.show()
univariate_analysis_piechart_bargraph(data, "Pass/Fail")
Here we can see that the data is highly unbalanced.
Target column “ –1” corresponds to a pass and “1” corresponds to a fail.
We have large number of data on pass and very less data on fail
def univariate_analysis_boxplot_distplot(dataset, criteria):
f, axes = plt.subplots(1, 2, figsize=(17,7))
sns.boxplot(x = criteria, data=dataset, orient='h' , ax=axes[1])
sns.distplot(dataset[criteria], ax=axes[0])
axes[0].set_title('Distribution plot')
axes[1].set_title('Box plot')
plt.show()
#checking count of outliers.
q25,q75=np.percentile(dataset[criteria],25),np.percentile(dataset[criteria],75)
IQR=q75-q25
Threshold=IQR*1.5
lower,upper=q25-Threshold,q75+Threshold
Outliers=[i for i in dataset[criteria] if i < lower or i > upper]
print(f'Total Number of outliers in {criteria}: {len(Outliers)}')
univariate_analysis_boxplot_distplot(data, "0")
Total Number of outliers in 0: 55
univariate_analysis_boxplot_distplot(data, "1")
Total Number of outliers in 1: 91
univariate_analysis_boxplot_distplot(data, "2")
Total Number of outliers in 2: 39
univariate_analysis_boxplot_distplot(data, "4")
Total Number of outliers in 4: 62
As we can see that there is large amount skewness and outliers in the data
def bivariate_boxplot_bargraph(data,criteria, criteria2):
f,axes=plt.subplots(1,2,figsize=(17,7))
sns.boxplot(x=criteria, y=criteria2, data= data, ax=axes[0])
sns.barplot(x=criteria, y=criteria2, data= data, ax=axes[1])
axes[0].set_title(f'{criteria} Variable Pie Chart with hue {criteria2}')
axes[1].set_title(f'{criteria} Variable Bar Graph with hue {criteria2}')
plt.show()
bivariate_boxplot_bargraph(data, "Pass/Fail", "0")
bivariate_boxplot_bargraph(data, "Pass/Fail", "1")
bivariate_boxplot_bargraph(data, "Pass/Fail", "2")
bivariate_boxplot_bargraph(data, "Pass/Fail", "4")
As we can see that many of the columns are not good in deciding the pass/fails
The distribution is overlapping for pass and fails
data.columns
Index(['0', '1', '2', '3', '4', '8', '9', '10', '11', '13',
...
'565', '570', '571', '572', '582', '583', '586', '587', '589',
'Pass/Fail'],
dtype='object', length=297)
df_pairplot= data.iloc[:,:10]
sns.pairplot(df_pairplot,diag_kind='hist',corner=True)
<seaborn.axisgrid.PairGrid at 0x7fa4df7d5340>
plt.rcParams['figure.figsize'] = (18, 18)
sns.heatmap(data.corr(), cmap = "YlGnBu")
plt.title('Correlation heatmap for the Data', fontsize = 20)
Text(0.5, 1.0, 'Correlation heatmap for the Data')
Since we have already removed highly correlated data, we can see that there are lesser correlated data
• Segregate predictors vs target attributes
• Check for target balancing and fix it if found imbalanced.
• Perform train-test split and standardise the data or vice versa if required.
• Check if the train and test data have similar statistical characteristics when compared with original data.
x = data.iloc[:,:296]
y = data["Pass/Fail"]
# getting the shapes of new data sets x and y
print("shape of x:", x.shape)
print("shape of y:", y.shape)
shape of x: (1567, 296) shape of y: (1567,)
• Check for target balancing and fix it if found imbalanced.
univariate_analysis_piechart_bargraph(data, "Pass/Fail")
smote_nc=SMOTE(random_state=42) #specifying categorical column numbers
x_s,y_s=smote_nc.fit_resample(x,y)
print('Before sampling:')
print(y.value_counts())
Before sampling: -1 1463 1 104 Name: Pass/Fail, dtype: int64
print('After sampling:')
print(y_s.value_counts())
After sampling: -1 1463 1 1463 Name: Pass/Fail, dtype: int64
x_s.isnull().sum().sum()
0
• Perform train-test split and standardise the data or vice versa if required.
# splitting them into train test and split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
# gettiing the shapes
print("Un-Sampled data")
print("shape of x_train: ", x_train.shape)
print("shape of x_test: ", x_test.shape)
print("shape of y_train: ", y_train.shape)
print("shape of y_test: ", y_test.shape)
x_train_s, x_test_s, y_train_s, y_test_s = train_test_split(x_s, y_s, test_size = 0.3, random_state = 1)
# gettiing the shapes
print("Sampled data")
print("shape of x_train: ", x_train_s.shape)
print("shape of x_test: ", x_test_s.shape)
print("shape of y_train: ", y_train_s.shape)
print("shape of y_test: ", y_test_s.shape)
Un-Sampled data shape of x_train: (1096, 296) shape of x_test: (471, 296) shape of y_train: (1096,) shape of y_test: (471,) Sampled data shape of x_train: (2048, 296) shape of x_test: (878, 296) shape of y_train: (2048,) shape of y_test: (878,)
# standardization
# creating a standard scaler
sc = StandardScaler()
# fitting independent data to the model
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
# We have included scaling in the pipeline, Hence we wont be scaling the sampled data
# x_train_s = sc.fit_transform(x_train_s)
# x_test_s = sc.transform(x_test_s)
We have included scaling in the pipeline, Hence we wont be scaling the sampled data
• Check if the train and test data have similar statistical characteristics when compared with original data.
# gettiing the shapes
print("Un-Sampled data")
print("shape of x_train: ", x_train.shape)
print("shape of x_test: ", x_test.shape)
print("shape of y_train: ", y_train.shape)
print("shape of y_test: ", y_test.shape)
# gettiing the shapes
print("Sampled data")
print("shape of x_train: ", x_train_s.shape)
print("shape of x_test: ", x_test_s.shape)
print("shape of y_train: ", y_train_s.shape)
print("shape of y_test: ", y_test_s.shape)
Un-Sampled data shape of x_train: (1096, 296) shape of x_test: (471, 296) shape of y_train: (1096,) shape of y_test: (471,) Sampled data shape of x_train: (2048, 296) shape of x_test: (878, 296) shape of y_train: (2048,) shape of y_test: (878,)
print('y_train_s sampling:')
print(y_train_s.value_counts())
print('y_test_s sampling:')
print(y_test_s.value_counts())
y_train_s sampling: 1 1029 -1 1019 Name: Pass/Fail, dtype: int64 y_test_s sampling: -1 444 1 434 Name: Pass/Fail, dtype: int64
x_s.describe()
| 0 | 1 | 2 | 3 | 4 | 8 | 9 | 10 | 11 | 13 | ... | 562 | 565 | 570 | 571 | 572 | 582 | 583 | 586 | 587 | 589 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.0 | ... | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 | 2926.000000 |
| mean | 3001.776436 | 2482.096537 | 2189.673053 | 1370.385270 | 2.827280 | 1.466965 | -0.001926 | 0.000220 | 0.962659 | 0.0 | ... | 202.202233 | 0.118531 | 529.125924 | 2.100473 | 23.715513 | 0.500064 | 0.015376 | 0.021585 | 0.017026 | 99.154098 |
| std | 153.204417 | 181.264104 | 154.267275 | 387.293332 | 41.076926 | 0.074023 | 0.013526 | 0.008514 | 0.027289 | 0.0 | ... | 100.486368 | 0.094189 | 22.353654 | 0.285849 | 69.929007 | 0.009803 | 0.013091 | 0.010970 | 0.008288 | 80.485741 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.053400 | -0.034900 | 0.000000 | 0.0 | ... | 0.000000 | 0.000000 | 317.196400 | 0.980200 | 3.540000 | 0.000000 | 0.000000 | -0.016900 | 0.000000 | 0.000000 |
| 25% | 2961.334528 | 2453.491308 | 2181.188900 | 1104.806654 | 1.064700 | 1.428011 | -0.010488 | -0.004800 | 0.957200 | 0.0 | ... | 156.338006 | 0.064639 | 530.899400 | 1.990425 | 7.645731 | 0.498200 | 0.011700 | 0.014710 | 0.011300 | 50.600975 |
| 50% | 2997.795624 | 2497.580000 | 2198.485674 | 1288.085700 | 1.303746 | 1.467000 | -0.002285 | 0.000901 | 0.964100 | 0.0 | ... | 260.211248 | 0.095605 | 532.420877 | 2.141700 | 8.798509 | 0.500338 | 0.013900 | 0.020845 | 0.015739 | 76.615926 |
| 75% | 3048.355000 | 2534.371113 | 2215.843367 | 1579.879826 | 1.485942 | 1.511108 | 0.005855 | 0.005670 | 0.969600 | 0.0 | ... | 264.272000 | 0.166475 | 534.150000 | 2.295568 | 10.089193 | 0.502324 | 0.016700 | 0.027500 | 0.020915 | 117.900882 |
| max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 1.656400 | 0.074900 | 0.053000 | 0.984800 | 0.0 | ... | 311.404000 | 0.689200 | 589.508200 | 2.739500 | 454.560000 | 0.509800 | 0.476600 | 0.102800 | 0.079900 | 737.304800 |
8 rows × 296 columns
pd.DataFrame(x_train).describe()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1096.0 | ... | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 | 1.096000e+03 |
| mean | -1.825790e-15 | -2.762186e-15 | -9.245696e-16 | -1.418168e-17 | -4.460898e-17 | 9.152756e-16 | -4.376062e-17 | 5.409298e-17 | 1.468159e-15 | 0.0 | ... | 5.405753e-16 | -1.519972e-16 | -1.695623e-15 | 1.381802e-15 | -3.558082e-17 | 4.562728e-15 | -1.021777e-16 | 3.159729e-16 | -1.489077e-16 | 1.120353e-16 |
| std | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 0.0 | ... | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 | 1.000457e+00 |
| min | -1.535479e+01 | -1.336611e+01 | -1.152217e+01 | -3.068868e+00 | -7.538250e-02 | -1.512694e+01 | -3.418629e+00 | -3.706935e+00 | -2.220206e+01 | 0.0 | ... | -2.118332e+00 | -1.300661e+00 | -1.286803e+01 | -4.021080e+00 | -2.836542e-01 | -3.228820e+01 | -1.018719e+00 | -2.204840e+00 | -1.788063e+00 | -1.045517e+00 |
| 25% | -1.984389e-01 | -1.831429e-01 | -1.894175e-02 | -6.754960e-01 | -5.781365e-02 | -5.158039e-01 | -6.542373e-01 | -6.153878e-01 | -1.009546e-01 | 0.0 | ... | 3.480628e-01 | -3.674868e-01 | -1.456058e-02 | -4.285942e-01 | -2.365849e-01 | -1.135806e-01 | -2.380938e-01 | -6.620602e-01 | -6.657830e-01 | -5.909975e-01 |
| 50% | 4.299285e-02 | 7.521050e-02 | 8.401211e-02 | -2.196680e-01 | -5.269689e-02 | 1.465393e-02 | -3.956676e-02 | 4.466882e-02 | 7.781960e-02 | 0.0 | ... | 4.865617e-01 | -3.240834e-01 | 9.888759e-02 | 5.758119e-02 | -2.218889e-01 | 2.861335e-02 | -9.340902e-02 | -9.051233e-02 | -1.909720e-01 | -3.052279e-01 |
| 75% | 2.786106e-01 | 2.897631e-01 | 1.767830e-01 | 4.497790e-01 | -4.904581e-02 | 5.831495e-01 | 6.141305e-01 | 6.483104e-01 | 2.012315e-01 | 0.0 | ... | 4.900855e-01 | 5.426292e-01 | 2.233755e-01 | 6.728941e-01 | -2.048378e-01 | 1.708073e-01 | 9.165296e-02 | 5.050165e-01 | 4.484041e-01 | 1.707319e-01 |
| max | 1.424907e+00 | 1.750319e+00 | 6.312036e-01 | 4.949082e+00 | 1.911474e+01 | 2.017456e+00 | 4.926581e+00 | 4.134763e+00 | 5.149513e-01 | 0.0 | ... | 9.511355e-01 | 6.000884e+00 | 3.722848e+00 | 2.286799e+00 | 5.094218e+00 | 6.620228e-01 | 3.105420e+01 | 6.492280e+00 | 6.834072e+00 | 6.569493e+00 |
8 rows × 296 columns
pd.DataFrame(x_test).describe()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.0 | ... | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 | 471.000000 |
| mean | -0.020117 | -0.000672 | -0.064074 | -0.011446 | -0.012522 | -0.016861 | 0.000378 | 0.071870 | 0.049538 | 0.0 | ... | 0.067164 | 0.013524 | -0.086499 | -0.017880 | 0.044480 | 0.047249 | 0.038045 | -0.036585 | -0.037811 | -0.055773 |
| std | 1.075556 | 0.980460 | 1.312291 | 1.054170 | 0.881602 | 0.767404 | 0.940772 | 1.153310 | 0.209756 | 0.0 | ... | 0.947260 | 0.998487 | 1.335864 | 0.962501 | 1.093258 | 0.221912 | 1.456030 | 0.960685 | 0.825645 | 0.894104 |
| min | -15.354795 | -13.366107 | -11.522168 | -3.068868 | -0.075382 | -2.701290 | -2.599068 | -3.932596 | -0.460810 | 0.0 | ... | -2.118332 | -1.300661 | -13.584310 | -4.037256 | -0.268153 | -1.406253 | -0.581300 | -3.076150 | -1.334835 | -1.045517 |
| 25% | -0.184281 | -0.156260 | -0.026985 | -0.679289 | -0.058584 | -0.542197 | -0.618463 | -0.660520 | -0.106722 | 0.0 | ... | 0.357269 | -0.349040 | -0.022734 | -0.440097 | -0.235482 | -0.094191 | -0.244823 | -0.610101 | -0.611827 | -0.536612 |
| 50% | 0.010920 | 0.080375 | 0.086767 | -0.258872 | -0.053161 | -0.035028 | -0.003792 | 0.072876 | 0.076666 | 0.0 | ... | 0.486562 | -0.320828 | 0.076617 | 0.033677 | -0.224631 | 0.060930 | -0.090044 | -0.070528 | -0.190972 | -0.297988 |
| 75% | 0.243534 | 0.290570 | 0.175494 | 0.470343 | -0.049504 | 0.553392 | 0.565347 | 0.716008 | 0.208152 | 0.0 | ... | 0.493511 | 0.452296 | 0.204623 | 0.673343 | -0.205792 | 0.190197 | 0.084923 | 0.461051 | 0.289235 | 0.076201 |
| max | 1.800484 | 1.945694 | 0.689515 | 5.160182 | 19.079201 | 1.668647 | 3.202902 | 5.985179 | 0.461896 | 0.0 | ... | 0.834982 | 6.177753 | 2.482748 | 2.286799 | 4.989766 | 0.662023 | 30.704269 | 6.492280 | 3.672263 | 6.569493 |
8 rows × 296 columns
# sns.countplot(data=y_train_s)
f,axes=plt.subplots(1,3,figsize=(17,7))
pd.value_counts(y_train_s).plot.bar(ax=axes[0])
pd.value_counts(y_test_s).plot.bar(ax=axes[1])
pd.value_counts(y_s).plot.bar(ax=axes[2])
<AxesSubplot:>
Here we can see that the training, testing, and initial data are having similar statistical distribution
The training, testing, and initial data are having counts of pass/fail
Hence we can say that the training, testing, and initial data are habing similar statistical characteristics when compared with original data.
• Model training:
- Pick up a supervised learning model.
- Train the model.
- Use cross validation techniques.
Hint: Use all CV techniques that you have learnt in the course.
- Apply hyper-parameter tuning techniques to get the best accuracy.
Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
- Use any other technique/method which can enhance the model performance.
Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.
- Display and explain the classification report in detail.
- Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.
Hint: You can use your concepts learnt under Applied Statistics module.
- Apply the above steps for all possible models that you have learnt so far.
• Display and compare all the models designed with their train and test accuracies.
• Select the final best trained model along with your detailed comments for selecting this model.
• Pickle the selected model for future use.
• Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results.
• Model training:
- Pick up a supervised learning model.
- Train the model.
- Use cross validation techniques.
Hint: Use all CV techniques that you have learnt in the course.
- Apply hyper-parameter tuning techniques to get the best accuracy.
Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
- Use any other technique/method which can enhance the model performance.
Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.
- Display and explain the classification report in detail.
- Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.
Hint: You can use your concepts learnt under Applied Statistics module.
- Apply the above steps for all possible models that you have learnt so far.
- Pick up a supervised learning model.
- Train the model.
log=LogisticRegression(max_iter=1000,random_state=42)
k=KNeighborsClassifier()
gbc =GradientBoostingClassifier(random_state=42)
svc = SVC(random_state=42)
rfc = RandomForestClassifier(random_state=42,n_estimators=10)
xg_clf = XGBClassifier(colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 50,random_state=42)
lgb = LGBMClassifier(random_state=42)
algorithms=[k,log,gbc,rfc,svc,xg_clf,lgb]
names=['KNeighborsClassifier','Logistic','GradientBoost','RandomForest','SVC','xgboost','LGB']
- Use cross validation techniques.
- Apply hyper-parameter tuning techniques to get the best accuracy.
def crossvalidate(clf, X, y):
scores={'ACCURACY':'accuracy',
'RECALL':'recall',
'PRECISION':'precision',
'F1':'f1'}
model = Pipeline([
("scaling", sc),
('classification', clf)])
skf = StratifiedKFold(n_splits=5,random_state=None,shuffle=True)
crossvalidate = cross_validate(model, X, y, scoring= scores,cv=skf)
score=pd.DataFrame(crossvalidate).mean()
return score
clf=LogisticRegression(max_iter=1000,random_state=42)
crossvalidate(clf, x_train_s, y_train_s)
fit_time 0.070765 score_time 0.007283 test_ACCURACY 0.883778 test_RECALL 0.945584 test_PRECISION 0.842808 test_F1 0.891139 dtype: float64
Logistic regression :
Here we can see that accuracy is 88.32, which is fair
We can see that recall is higher than precision here. Precision quantifies the number of positive class predictions that actually belong to the positive class. Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.
F1 score here is fair enough
Lets iterate over the other model and find out the comapare the models
def cross_vali_fit_pred(X_train, y_train, algorithms = algorithms, names = names):
metrics=pd.DataFrame()
for i in range(len(algorithms)):
scr=crossvalidate(algorithms[i],X_train,y_train)
metrics=metrics.append(scr[2:],ignore_index=True)
metrics.index=names
metrics.columns=['ACCURACY','F1','PRECISION','RECALL']
return metrics.sort_values(by=['F1'],ascending=False)
cross_vali_fit_pred(x_train_s, y_train_s, algorithms = algorithms, names = names)
[17:10:53] WARNING: /Users/runner/work/xgboost/xgboost/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:10:53] WARNING: /Users/runner/work/xgboost/xgboost/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:10:54] WARNING: /Users/runner/work/xgboost/xgboost/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:10:54] WARNING: /Users/runner/work/xgboost/xgboost/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:10:54] WARNING: /Users/runner/work/xgboost/xgboost/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
| ACCURACY | F1 | PRECISION | RECALL | |
|---|---|---|---|---|
| SVC | 0.988282 | 0.988475 | 0.978162 | 0.999029 |
| LGB | 0.980954 | 0.981000 | 0.982452 | 0.979588 |
| GradientBoost | 0.963376 | 0.963648 | 0.961401 | 0.965982 |
| RandomForest | 0.954585 | 0.954212 | 0.967200 | 0.941691 |
| xgboost | 0.944328 | 0.945152 | 0.938593 | 0.952366 |
| Logistic | 0.888187 | 0.894753 | 0.848569 | 0.946583 |
| KNeighborsClassifier | 0.610360 | 0.720632 | 0.563304 | 1.000000 |
Here we can see that SVC and LGB is giving best results
SVC and LGB has better accuracy, f1score , precision and recall when compared to other models
SVC f1 score = 0.986048 LGB f1 score = 0.979561
Let use these 2 model for future comparision and analysis
from sklearn.model_selection import RandomizedSearchCV
def Randomsearch_Gridsearch(clf, X_train,Y_train,X_test,Y_test,params,cv_type):
model = Pipeline([
("scaling", sc),
('classification', clf)])
if cv_type == 'random':
CV=RandomizedSearchCV(model, params, cv=5,scoring='f1',n_iter=20,random_state=42)
elif cv_type == 'grid':
CV = GridSearchCV(model, params, cv=5,scoring='f1')
CV.fit(X_train, Y_train)
y_pred=CV.predict(X_test)
f1score=f1_score(Y_test, y_pred, average='macro')
return CV,y_pred,f1score
params=[{'classification__C': [0.1, 1, 10, 100, 1000],
'classification__gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'classification__kernel': ['rbf']},
{
'classification__n_estimators': [400, 700, 1000],
'classification__colsample_bytree': [0.7, 0.8],
'classification__max_depth': [15,20,25],
'classification__num_leaves': [50, 100, 200],
'classification__reg_alpha': [1.1, 1.2, 1.3],
'classification__reg_lambda': [1.1, 1.2, 1.3],
'classification__min_split_gain': [0.3, 0.4],
'classification__subsample': [0.7, 0.8, 0.9],
'classification__subsample_freq': [20]
}]
classifiers=[('SVC',
SVC(random_state=42)),
('LGB Classifier',
LGBMClassifier(random_state=42))]
for param, classifier in zip(params, classifiers):
print("Working on {}...".format(classifier[0]))
clf,y_pred,f1score = Randomsearch_Gridsearch(classifier[1], x_train_s, y_train_s,x_test_s,y_test_s, param,'random')
print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
#
print("Best `F1` for {} is {}".format(classifier[0], f1score))
print('-'*50)
print('\n')
Working on SVC...
Best parameter for SVC is {'classification__kernel': 'rbf', 'classification__gamma': 0.01, 'classification__C': 1000}
Best `F1` for SVC is 0.9988609281489563
--------------------------------------------------
Working on LGB Classifier...
Best parameter for LGB Classifier is {'classification__subsample_freq': 20, 'classification__subsample': 0.7, 'classification__reg_lambda': 1.3, 'classification__reg_alpha': 1.1, 'classification__num_leaves': 50, 'classification__n_estimators': 700, 'classification__min_split_gain': 0.4, 'classification__max_depth': 15, 'classification__colsample_bytree': 0.7}
Best `F1` for LGB Classifier is 0.9703718701491529
--------------------------------------------------
We can see that with parameters 'classificationkernel': 'rbf', 'classificationgamma': 0.01, 'classification__C': 1000, we are getting F1 for SVC is 0.9988609281489563
We can see that with parameters 'classificationsubsample_freq': 20, 'classificationsubsample': 0.7, 'classificationreg_lambda': 1.3, 'classificationreg_alpha': 1.1, 'classificationnum_leaves': 50, 'classificationn_estimators': 700, 'classificationmin_split_gain': 0.4, 'classificationmax_depth': 15, 'classification__colsample_bytree': 0.7 , we are getting F1 for LGB Classifier is 0.9703718701491529
Hence we can infer on above analysis that with SVC we are getting best results among the models we have taken into consideration
# Commenting as grid search is taking too much time and results is similar to Randomsearch
# Since grid search is taking longer time for lgb, lets consider only SVC for the grid search purpose
param = params[0]
classifier = classifiers[0]
print("Working on {}...".format(classifier[0]))
clf,y_pred,f1score = Randomsearch_Gridsearch(classifier[1], x_train_s, y_train_s,x_test_s,y_test_s, param,'grid')
print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
#
print("Best `F1` for {} is {}".format(classifier[0], f1score))
print('-'*50)
print('\n')
Working on SVC...
Best parameter for SVC is {'classification__C': 10, 'classification__gamma': 0.01, 'classification__kernel': 'rbf'}
Best `F1` for SVC is 0.9988609281489563
--------------------------------------------------
Grid search cv also providing similar results but the random search cv has letter time of execution
Grid search cv is taking far more execution time
We have already tried out attribute removal,standardisation/normalisation, target balancing. Lets try dimensionality reduction and attribute removal
pca = PCA().fit(x_train_s)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
[<matplotlib.lines.Line2D at 0x7fa4dffc6af0>]
#Based on the plot, we will select 25 components
pca = PCA(n_components=25)
pca.fit(x_train_s)
#Assign the components to the X variable
x_transformed = pca.transform(x_s)
x_train_res_pca, x_test_res_pca, y_train_res_pca, y_test_res_pca = train_test_split(x_transformed, y_s, test_size=0.30, random_state=1)
x_transformed.shape
(2926, 25)
x_train_res_pca_s = sc.fit_transform(x_train_res_pca)
x_test_res_pca_s = sc.transform(x_test_res_pca)
for param, classifier in zip(params, classifiers):
print("Working on {}...".format(classifier[0]))
clf,y_pred,f1score = Randomsearch_Gridsearch(classifier[1], x_train_res_pca_s, y_train_res_pca,x_test_res_pca_s,y_test_res_pca, param,'random')
print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
#
print("Best `F1` for {} is {}".format(classifier[0], f1score))
print('-'*50)
print('\n')
Working on SVC...
Best parameter for SVC is {'classification__kernel': 'rbf', 'classification__gamma': 0.1, 'classification__C': 100}
Best `F1` for SVC is 0.9931661451766062
--------------------------------------------------
Working on LGB Classifier...
Best parameter for LGB Classifier is {'classification__subsample_freq': 20, 'classification__subsample': 0.9, 'classification__reg_lambda': 1.1, 'classification__reg_alpha': 1.1, 'classification__num_leaves': 100, 'classification__n_estimators': 1000, 'classification__min_split_gain': 0.3, 'classification__max_depth': 20, 'classification__colsample_bytree': 0.8}
Best `F1` for LGB Classifier is 0.9646913378451831
--------------------------------------------------
After reducing the dimension from 296 to 25, we are getting f1 score of 0.9931661451766062 for SVC. This is good enough f1 score
We were able to reduce the dimension significantly here.
- Display and explain the classification report in detail.
def svm_analysis(x_train,y_train,x_test,y_test):
clf = svm.SVC(gamma=0.1, C=100,kernel = 'rbf')
############################################################################################
# Design and train a Naive Bayes Classifier
############################################################################################
clf.fit(x_train, y_train)
prediction = clf.predict(x_test)
print('Prediction: {}'.format(prediction))
############################################################################################
# Display the classification accuracies for train and test data.
############################################################################################
print('With SVM accuracy of train data is: ',clf.score(x_train,y_train)) # accuracy
print('With SVM accuracy of test data is: ',clf.score(x_test,y_test)) # accuracy
############################################################################################
#Display and explain the classification report in detail.
############################################################################################
# Confusion Matrix
from sklearn import metrics
predicted_labels = clf.predict(x_test)
print("Confusion Matrix")
cm=metrics.confusion_matrix(y_test, predicted_labels, labels=[-1,1])
df_cm = pd.DataFrame(cm, index = [i for i in [-1,1]],
columns = [i for i in ["Predict pass", "Predict Fail"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)
# Classification Report
print('\n{}'.format(classification_report(y_test, prediction)))
return clf
svm_analysis(x_train_res_pca_s,y_train_res_pca,x_test_res_pca_s,y_test_res_pca)
Prediction: [-1 -1 1 1 1 -1 1 1 1 -1 1 1 1 1 1 -1 1 1 -1 -1 1 1 1 1
1 1 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1 -1 1 -1 -1 1 1 -1 1 1 1
-1 1 1 -1 -1 1 1 1 1 -1 1 -1 1 1 -1 -1 1 1 1 -1 1 1 1 -1
-1 -1 -1 -1 -1 -1 1 -1 1 1 -1 1 1 -1 1 -1 -1 -1 1 -1 -1 1 1 -1
-1 1 1 1 1 1 -1 1 -1 1 1 1 -1 1 -1 1 -1 1 -1 -1 1 1 -1 -1
1 -1 1 -1 -1 -1 1 1 1 1 1 1 -1 -1 1 -1 1 -1 1 1 1 -1 -1 -1
-1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 1 1
1 1 1 1 -1 1 -1 1 1 -1 -1 1 -1 -1 1 1 1 1 -1 -1 -1 1 -1 1
-1 1 -1 1 1 1 -1 1 1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 1 1 1 1
-1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 1 1 -1 -1 -1 1 1 -1 -1 -1 -1 -1
1 -1 -1 -1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 1 1 1 1 -1 1 -1
1 1 1 1 -1 -1 1 1 1 -1 -1 1 1 1 1 -1 1 1 1 -1 1 1 -1 -1
-1 1 1 1 -1 1 1 -1 1 1 1 -1 -1 1 1 1 -1 -1 1 -1 -1 1 -1 -1
1 -1 -1 1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 -1 1
1 1 -1 -1 -1 -1 -1 1 -1 -1 1 -1 -1 1 1 1 -1 1 1 1 1 1 -1 1
1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 1 1 1 1 -1 1 1 1 1 -1 -1
-1 1 1 -1 -1 -1 1 -1 -1 1 1 -1 1 -1 1 -1 -1 1 1 1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 -1 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 1 -1 1 1
-1 -1 1 -1 1 1 1 -1 -1 -1 -1 1 1 -1 1 1 1 1 -1 1 -1 -1 -1 -1
1 1 1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1
-1 1 -1 1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 1 1 1
-1 -1 1 -1 -1 -1 1 1 1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 1 -1 -1
1 1 -1 -1 1 1 1 -1 -1 1 -1 1 1 -1 -1 -1 1 -1 1 -1 -1 -1 1 1
-1 -1 1 -1 1 1 -1 1 1 1 -1 -1 1 -1 1 1 -1 -1 1 -1 -1 -1 1 -1
-1 1 1 1 1 -1 -1 -1 -1 1 1 -1 -1 1 1 1 1 1 -1 -1 1 1 1 -1
1 -1 -1 1 1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 1
1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 -1 1 -1 -1 1 -1 1 1 -1 -1 1 1
-1 -1 -1 1 1 -1 1 1 1 -1 -1 1 1 -1 -1 1 1 1 1 -1 -1 -1 1 1
1 1 1 1 -1 -1 -1 1 1 1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 1 1 -1
1 1 1 -1 1 1 1 -1 1 -1 1 -1 1 -1 -1 1 -1 -1 -1 1 1 1 -1 -1
1 -1 1 1 1 -1 1 -1 1 -1 1 1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1
1 -1 -1 -1 -1 1 -1 -1 1 -1 1 1 -1 1 1 -1 1 1 1 1 -1 -1 -1 1
1 1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1 -1 1
1 1 -1 1 1 1 1 -1 1 -1 1 1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 -1
-1 1 -1 -1 1 1 -1 1 -1 -1 -1 1 1 1 1 -1 1 1 1 1 1 1 1 1
-1 -1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1
1 -1 -1 1 1 -1 1 -1 -1 -1 -1 1 1 1]
With SVM accuracy of train data is: 1.0
With SVM accuracy of test data is: 0.9931662870159453
Confusion Matrix
precision recall f1-score support
-1 1.00 0.99 0.99 444
1 0.99 1.00 0.99 434
accuracy 0.99 878
macro avg 0.99 0.99 0.99 878
weighted avg 0.99 0.99 0.99 878
SVC(C=100, gamma=0.1)
After the demensionality reduction the results are :
With SVM accuracy of train data is: 1.0
With SVM accuracy of test data is: 0.9931662870159453
Confusion Matrix
precision recall f1-score support
-1 1.00 0.99 0.99 444
1 0.99 1.00 0.99 434
accuracy 0.99 878
macro avg 0.99 0.99 0.99 878
weighted avg 0.99 0.99 0.99 878
- Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.
Feature Engineering
clf = RandomForestClassifier(n_estimators=50, max_features='auto')
clf= clf.fit(x_train,y_train)
feat_importances = pd.Series(clf.feature_importances_, index=x_s.columns)
print(feat_importances.nlargest(10))
color = list('rgbkymc')
feat_importances.nlargest(10).plot(kind='barh',color=color)
plt.title('Feature Importance')
plt.show()
59 0.016501 210 0.014281 71 0.013464 500 0.012461 102 0.012329 99 0.011464 218 0.011245 21 0.010535 121 0.010216 33 0.009698 dtype: float64
Here we can see that features like 59 is of highest importance. The above are the features that are most important among the all the features
We have already used the stratified k fold technique and found out the result as below :
Here wee have tried out different training and testing dataset for the purpose with stratified k fold .
| MODEL | ACCURACY | F1 | PRECISION | RECALL |
|---|---|---|---|---|
| SVC | 0.985839 | 0.986116 | 0.972641 | 1.000000 |
| LGB | 0.980471 | 0.980428 | 0.985463 | 0.975714 |
| GradientBoost | 0.964353 | 0.964863 | 0.956153 | 0.973758 |
| RandomForest | 0.950193 | 0.949738 | 0.963232 | 0.936851 |
| xgboost | 0.947269 | 0.947777 | 0.943404 | 0.952399 |
| Logistic | 0.881832 | 0.889449 | 0.839742 | 0.945546 |
| KNeighborsClassifier | 0.610844 | 0.720852 | 0.563549 | 1.000000 |
We also designed and tried out tuning with different parameter over range of training and testing data through Random search and grid search cv
We have tried feature engineering and dimensionality reduction also.
• Display and compare all the models designed with their train and test accuracies.
• Select the final best trained model along with your detailed comments for selecting this model.
Reference:
Precision: When it predicts the positive result, how often is it correct? i.e. limit the number of false positives.
Recall: When it is actually the positive result, how often does it predict correctly? i.e. limit the number of false negatives.
Precision quantifies the number of positive class predictions that actually belong to the positive class. Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.
| MODEL | ACCURACY | F1 | PRECISION | RECALL |
|---|---|---|---|---|
| SVC | 0.987792 | 0.987989 | 0.977201 | 0.999029 |
| LGB | 0.979974 | 0.980092 | 0.980860 | 0.979583 |
| GradientBoost | 0.964846 | 0.965202 | 0.959684 | 0.970860 |
| RandomForest | 0.953607 | 0.953359 | 0.964430 | 0.942652 |
| xgboost | 0.948732 | 0.949334 | 0.942626 | 0.956278 |
| Logistic | 0.877438 | 0.885142 | 0.836867 | 0.939749 |
| KNeighborsClassifier | 0.618148 | 0.724786 | 0.568462 | 1.000000 |
We have tried out below steps and conclusions are below :
We trained the above list of models and found out that SVM and Light gbm is giving us best results in terms of precision,recall, accuracy and f1 scores as seen in above table
We used different cross validation techniques like:
- GRID SEARCH CV
- RANDOM SEARCH CV
We have averaged out the result and publish as in th table
Here also SVC and LGB has given best results compared to other models
We used hypertuning models to find out best parameters:
SVC...
Best parameter for SVC is {'classificationkernel': 'rbf', 'classificationgamma': 0.01, 'classification__C': 1000}
Best F1 for SVC is 0.9988609281489563
LGB Classifier...
Best parameter for LGB Classifier is {'classificationsubsample_freq': 20, 'classificationsubsample': 0.7, 'classificationreg_lambda': 1.3, 'classificationreg_alpha': 1.1, 'classificationnum_leaves': 50, 'classificationn_estimators': 700, 'classificationmin_split_gain': 0.4, 'classificationmax_depth': 15, 'classification__colsample_bytree': 0.7}
Best F1 for LGB Classifier is 0.9703718701491529
After reducing the dimension from 296 to 25, we are getting f1 score of 0.9931661451766062 for SVC. This is good enough f1 score
We were able to reduce the dimension significantly here.
We removed highly correlated data with collinearity greater than 0.7
We removed the columns with more than 25% missing data
We standardised the data and scaled it.
We performed target balancing with SMOTE and balanced the data. We had checked the statistical characteristics of training, testing and original data
We have used the stratified k fold technique and found out the result.
Here wee have tried out different training and testing dataset for the purpose with stratified k fold .
We also designed and tried out tuning with different parameter over range of training and testing data through Random search and grid search cv
With SVM accuracy of train data is: 1.0
With SVM accuracy of test data is: 0.9931662870159453
Confusion Matrix
precision recall f1-score support
-1 1.00 0.99 0.99 444
1 0.99 1.00 0.99 434
accuracy 0.99 878
macro avg 0.99 0.99 0.99 878
weighted avg 0.99 0.99 0.99 878
• Pickle the selected model for future use.
• Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results.
svm_model = svm.SVC(gamma=0.01, C=1000,kernel = 'rbf')
final_model = Pipeline([
("scaling", sc),
('classification', svm_model)])
final_model.fit(x_train_s,y_train_s)
y_pred=final_model.predict(x_test_s)
print('With SVM accuracy of train data is: ',final_model.score(x_train_s,y_train_s)) # accuracy
print('With SVM accuracy of test data is: ',final_model.score(x_test_s,y_test_s)) # accuracy
With SVM accuracy of train data is: 1.0 With SVM accuracy of test data is: 0.9988610478359908
# save the model to disk
pickle.dump(final_model, open("finalized_model.sav", 'wb'))
def predict_result(model_file,data_file):
# load the model from disk
loaded_model = pickle.load(open(model_file, 'rb'))
numerical_ix = x_train_s.columns.astype(int)
test_data = pd.read_excel(data_file)
test_data = test_data[numerical_ix]
test_data= test_data.replace(np.nan,0)
test_data = test_data.astype(float)
# loaded_model.fit(x_train_s, y_train_s)
prediction = loaded_model.predict(test_data)
print('Prediction: {}'.format(prediction))
predict_result("finalized_model.sav","Future_predictions.xlsx")
Prediction: [-1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 1 -1 -1 -1]
Here by using the predict_result function and passing the saves model file and data file, we get the predicted results.
• Write your conclusion on the results.
With SVM accuracy of train data is: 1.0
With SVM accuracy of test data is: 0.9931662870159453
Confusion Matrix
precision recall f1-score support
-1 1.00 0.99 0.99 444
1 0.99 1.00 0.99 434
accuracy 0.99 878
macro avg 0.99 0.99 0.99 878
weighted avg 0.99 0.99 0.99 878
We can conclude from all the findings that SVM is best model as per our analysis and has provided good confusion matrix parameters and results
import pandas_profiling
df_profiling=pd.read_csv("signal-data.csv")
pandas_profiling.ProfileReport(df_profiling, minimal=True)